Goto

Collaborating Authors

 skin type


Prompt Triage: Structured Optimization Enhances Vision-Language Model Performance on Medical Imaging Benchmarks

Singhvi, Arnav, Bikia, Vasiliki, Aali, Asad, Chaudhari, Akshay, Daneshjou, Roxana

arXiv.org Artificial Intelligence

Vision-language foundation models (VLMs) show promise for diverse imaging tasks but often underperform on medical benchmarks. Prior efforts to improve performance include model finetuning, which requires large domain-specific datasets and significant compute, or manual prompt engineering, which is hard to generalize and often inaccessible to medical institutions seeking to deploy these tools. These challenges motivate interest in approaches that draw on a model's embedded knowledge while abstracting away dependence on human-designed prompts to enable scalable, weight-agnostic performance improvements. To explore this, we adapt the Declarative Self-improving Python (DSPy) framework for structured automated prompt optimization in medical vision-language systems through a comprehensive, formal evaluation. We implement prompting pipelines for five medical imaging tasks across radiology, gastroenterology, and dermatology, evaluating 10 open-source VLMs with four prompt optimization techniques. Optimized pipelines achieved a median relative improvement of 53% over zero-shot prompting baselines, with the largest gains ranging from 300% to 3,400% on tasks where zero-shot performance is low. These results highlight the substantial potential of applying automated prompt optimization to medical AI systems, demonstrating significant gains for vision-based applications requiring accurate clinical image interpretation. By reducing dependence on prompt design to elicit intended outputs, these techniques allow clinicians to focus on patient care and clinical decision-making. Furthermore, our experiments offer scalability and preserve data privacy, demonstrating performance improvement on open-source VLMs. We publicly release our evaluation pipelines to support reproducible research on specialized medical tasks, available at https://github.com/DaneshjouLab/prompt-triage-lab.



A Evaluation Information

Neural Information Processing Systems

To evaluate the change that image corruptions have to face detection systems, we measure the precision of the corrupted images while using the detections from the clean image as ground truth. While this approach obviates the need for real ground truth bounding boxes, it is also a principled measurement strategy for our main research question. Since we are primarily interested in how the system is affected by the corruption, this metric is superior to using real ground truth bounding boxes. This follows because we're interested in isolating the change in a system under a corruption which is exactly what this method measures. To compute precision, we first observe the face detections on each clean image.


Beauty Beyond Words: Explainable Beauty Product Recommendations Using Ingredient-Based Product Attributes

Liu, Siliang, Suresh, Rahul, Banitalebi-Dehkordi, Amin

arXiv.org Artificial Intelligence

Accurate attribute extraction is critical for beauty product recommendations and building trust with customers. This remains an open problem, as existing solutions are often unreliable and incomplete. We present a system to extract beauty-specific attributes using end-to-end supervised learning based on beauty product ingredients. A key insight to our system is a novel energy-based implicit model architecture. We show that this implicit model architecture offers significant benefits in terms of accuracy, explainability, robustness, and flexibility. Furthermore, our implicit model can be easily fine-tuned to incorporate additional attributes as they become available, making it more useful in real-world applications. We validate our model on a major e-commerce skincare product catalog dataset and demonstrate its effectiveness. Finally, we showcase how ingredient-based attribute extraction contributes to enhancing the explainability of beauty recommendations.


PatchAlign:Fair and Accurate Skin Disease Image Classification by Alignment with Clinical Labels

Aayushman, null, Gaddey, Hemanth, Mittal, Vidhi, Chawla, Manisha, Gupta, Gagan Raj

arXiv.org Artificial Intelligence

Deep learning models have achieved great success in automating skin lesion diagnosis. However, the ethnic disparity in these models' predictions needs to be addressed before deploying them. We introduce a novel approach, PatchAlign, to enhance skin condition image classification accuracy and fairness by aligning with clinical text representations of skin conditions. PatchAlign uses Graph Optimal Transport (GOT) Loss as a regularizer to perform cross-domain alignment. The representations obtained are robust and generalize well across skin tones, even with limited training samples. To reduce the effect of noise and artifacts in clinical dermatology images, we propose a learnable Masked Graph Optimal Transport for cross-domain alignment that further improves fairness metrics. We compare our model to the state-of-the-art FairDisCo on two skin lesion datasets with different skin types: Fitzpatrick17k and Diverse Dermatology Images (DDI). PatchAlign enhances the accuracy of skin condition image classification by 2.8% (in-domain) and 6.2% (out-domain) on Fitzpatrick17k, and 4.2% (in-domain) on DDI compared to FairDisCo. Additionally, it consistently improves the fairness of true positive rates across skin tones. The source code for the implementation is available at the following GitHub repository: https://github.com/aayushmanace/PatchAlign24, enabling easy reproduction and further experimentation.


Revisiting Skin Tone Fairness in Dermatological Lesion Classification

Kalb, Thorsten, Kushibar, Kaisar, Cintas, Celia, Lekadir, Karim, Diaz, Oliver, Osuala, Richard

arXiv.org Artificial Intelligence

Addressing fairness in lesion classification from dermatological images is crucial due to variations in how skin diseases manifest across skin tones. However, the absence of skin tone labels in public datasets hinders building a fair classifier. To date, such skin tone labels have been estimated prior to fairness analysis in independent studies using the Individual Typology Angle (ITA). Briefly, ITA calculates an angle based on pixels extracted from skin images taking into account the lightness and yellow-blue tints. These angles are then categorised into skin tones that are subsequently used to analyse fairness in skin cancer classification. In this work, we review and compare four ITA-based approaches of skin tone classification on the ISIC18 dataset, a common benchmark for assessing skin cancer classification fairness in the literature. Our analyses reveal a high disagreement among previously published studies demonstrating the risks of ITA-based skin tone estimation methods. Moreover, we investigate the causes of such large discrepancy among these approaches and find that the lack of diversity in the ISIC18 dataset limits its use as a testbed for fairness analysis. Finally, we recommend further research on robust ITA estimation and diverse dataset acquisition with skin tone annotation to facilitate conclusive fairness assessments of artificial intelligence tools in dermatology.


Robustness Disparities in Face Detection

Dooley, Samuel, Wei, George Z., Goldstein, Tom, Dickerson, John P.

arXiv.org Artificial Intelligence

Facial analysis systems have been deployed by large companies and critiqued by scholars and activists for the past decade. Many existing algorithmic audits examine the performance of these systems on later stage elements of facial analysis systems like facial recognition and age, emotion, or perceived gender prediction; however, a core component to these systems has been vastly understudied from a fairness perspective: face detection, sometimes called face localization. Since face detection is a pre-requisite step in facial analysis systems, the bias we observe in face detection will flow downstream to the other components like facial recognition and emotion prediction. Additionally, no prior work has focused on the robustness of these systems under various perturbations and corruptions, which leaves open the question of how various people are impacted by these phenomena. We present the first of its kind detailed benchmark of face detection systems, specifically examining the robustness to noise of commercial and academic models. We use both standard and recently released academic facial datasets to quantitatively analyze trends in face detection robustness. Across all the datasets and systems, we generally find that photos of individuals who are $\textit{masculine presenting}$, $\textit{older}$, of $\textit{darker skin type}$, or have $\textit{dim lighting}$ are more susceptible to errors than their counterparts in other identities.


Improving dermatology classifiers across populations using images generated by large diffusion models

Sagers, Luke W., Diao, James A., Groh, Matthew, Rajpurkar, Pranav, Adamson, Adewole S., Manrai, Arjun K.

arXiv.org Artificial Intelligence

Dermatological classification algorithms developed without sufficiently diverse training data may generalize poorly across populations. While intentional data collection and annotation offer the best means for improving representation, new computational approaches for generating training data may also aid in mitigating the effects of sampling bias. In this paper, we show that DALL$\cdot$E 2, a large-scale text-to-image diffusion model, can produce photorealistic images of skin disease across skin types. Using the Fitzpatrick 17k dataset as a benchmark, we demonstrate that augmenting training data with DALL$\cdot$E 2-generated synthetic images improves classification of skin disease overall and especially for underrepresented groups.


AI skin cancer diagnoses risk being less accurate for dark skin – study

#artificialintelligence

AI systems being developed to diagnose skin cancer run the risk of being less accurate for people with dark skin, research suggests. The potential of AI has led to developments in healthcare, with some studies suggesting image recognition technology based on machine learning algorithms can classify skin cancers as successfully as human experts. NHS trusts have begun exploring AI to help dermatologists triage patients with skin lesions. But researchers say more needs to be done to ensure the technology benefits all patients, after finding that few freely available image databases that could be used to develop or "train" AI systems for skin cancer diagnosis contain information on ethnicity or skin type. Those that do have very few images of people with dark skin.


Comparing Human and Machine Bias in Face Recognition

Dooley, Samuel, Downing, Ryan, Wei, George, Shankar, Nathan, Thymes, Bradon, Thorkelsdottir, Gudrun, Kurtz-Miott, Tiye, Mattson, Rachel, Obiwumi, Olufemi, Cherepanova, Valeriia, Goldblum, Micah, Dickerson, John P, Goldstein, Tom

arXiv.org Artificial Intelligence

Much recent research has uncovered and discussed serious concerns of bias in facial analysis technologies, finding performance disparities between groups of people based on perceived gender, skin type, lighting condition, etc. These audits are immensely important and successful at measuring algorithmic bias but have two major challenges: the audits (1) use facial recognition datasets which lack quality metadata, like LFW and CelebA, and (2) do not compare their observed algorithmic bias to the biases of their human alternatives. In this paper, we release improvements to the LFW and CelebA datasets which will enable future researchers to obtain measurements of algorithmic bias that are not tainted by major flaws in the dataset (e.g. identical images appearing in both the gallery and test set). We also use these new data to develop a series of challenging facial identification and verification questions that we administered to various algorithms and a large, balanced sample of human reviewers. We find that both computer models and human survey participants perform significantly better at the verification task, generally obtain lower accuracy rates on dark-skinned or female subjects for both tasks, and obtain higher accuracy rates when their demographics match that of the question. Computer models are observed to achieve a higher level of accuracy than the survey participants on both tasks and exhibit bias to similar degrees as the human survey participants.